Setup

Part 1: Data

In this analysis we will try to gain insights on trends associated with presidential voting in the United States using the Behavioral Risk Surveillance System (BRFSS) data.

Although that the BRFSS objective is to collect uniform, state-specific data on preventive health practices and risk behaviors that are linked to chronic diseases, it contains valuable demographic and social data that are not solely related to health risks.

The analysis presented here is observational, and we cannot establish the presence of causal relationships based on it.

BRFSS is gathered using a complex multi-stage sampling which allows us to somewhat generalize our conclusions over all phone users in the United States.

Possible sampling biases in this analysis are : Voluntary response bias. However, this bias is mitigated by the fact that individuals where called multiple times.

A correct analysis requires accounting for the sampling weights during calculations.


Part 2: Research questions

In the first question we will try to find the variables that are strongly related to voting. The second and third questions arouse during the explantory analysis of the first question.

Research quesion 1: Which variables are associated the most with voting?

Research quesion 2: What is the relationship between age, smoking and voting?

Research quesion 3: How is voting proportion distribbuted across income and state?


Part 3: Exploratory data analysis

Research quesion 1:

The field scntvot1 in the survey answers the following question: Did you vote in the last presedential elections?

Let us prepare the data, by 1- dropping all rows with na for scntvot1, 2- dropping any column with more than 20% missing values, 3- imputing the missing values for other parameters with the median, 4- adding numOfVotes encoding yes for voting as 1 and not voting as 0. 5- encoding categorial variable as numerical, 6- removing columns with constant values (done after imputation to avoid considering a column with na a variable)

In order to decide what factors among the studied in BRFSS affect voting the most, we calculate the weigted correlation matrix

We will check the variables that are have a corelation larger than 0.15 with voting for further investigation.

##   scntvot1 numOfVotes    X_age80    X_age_g  X_ageg5yr   hlthpln1 
##  1.0000000  1.0000000  0.2609374  0.2595597  0.2544495  0.2520316 
##   bloodcho  X_hispanc     qstver  X_smoker3 
##  0.2490440  0.1920012  0.1792830  0.1507625

Among the highly correlated variables we find: a- two age-related variabls X_age_g X_ageg5yr b- two healthcare-related:healthpln1,bloodcho and c- One race-related variable X_hispanic.

To look into possible relations with these variables we start by calculating and visualizing the aggregated weighted mean of voting over each variable as in the two plots below.

Calculations:

Visualization

We can see that voting is higher in: 1- non-Hispanic population, 2- people with health plans, 3- non-smokers and especially former smokers, 4- older age.

Since voting is high in former smokers, the second research question we chose is the relationship between age, smoking and voting. Quitting smoking is a chronological process and we might find some interesting pattern across age segments.

Having a health plan might be a sign of having proper financial means, To further investigate the relationship between financial situation and voting, the third research question will be : How voting proportion is distributed across income and state.

Research quesion 2:

What is the relationship between age, smoking and voting? We aggregate the mean of voting as a function of smoking again, but this time divided between different age segments.

Contrary to our premise, results show that, regardless of age, we can see that non-smokers tend to vote more than smokers.

Research quesion 3:

How income and location affect voting proportion? The actual worth of income depends on the location, so we try to see if the patter holds

We can see that across all the states, higher income is associated with a higher voting ratio.

Summary

BRFSS data indicates that the proportion of people who vote increase with age and income. We also found that non-smokers tend to vote more than smokers across all age segments.

Hispanic, Latinos and Spanish speaking people votes significantly less than the rest of the population.

The data also shows a positive association between the income in a certain location and the propability of voting.